There's still no connectivity to Facebook's DNS servers:
> traceroute a.ns.facebook.com
traceroute to a.ns.facebook.com (129.134.30.12), 30 hops max, 60 byte packets
1 dsldevice.attlocal.net (192.168.1.254) 0.484 ms 0.474 ms 0.422 ms
2 107-131-124-1.lightspeed.sntcca.sbcglobal.net (107.131.124.1) 1.592 ms 1.657 ms 1.607 ms
3 71.148.149.196 (71.148.149.196) 1.676 ms 1.697 ms 1.705 ms
4 12.242.105.110 (12.242.105.110) 11.446 ms 11.482 ms 11.328 ms
5 12.122.163.34 (12.122.163.34) 7.641 ms 7.668 ms 11.438 ms
6 cr83.sj2ca.ip.att.net (12.122.158.9) 4.025 ms 3.368 ms 3.394 ms
7 * * *
...
So they're hours into this outage and still haven't re-established connectivity to their own DNS servers.
"facebook.com" is registered with "registrarsafe.com" as registrar. "registrarsafe.com" is unreachable because it's using Facebook's DNS servers and is probably a unit of Facebook. "registrarsafe.com" itself is registered with "registrarsafe.com".
I'm not sure of all the implications of those circular dependencies, but it probably makes it harder to get things back up if the whole chain goes down. That's also probably why we're seeing the domain "facebook.com" for sale on domain sites. The registrar that would normally provide the ownership info is down.
Anyway, until "a.ns.facebook.com" starts working again, Facebook is dead.
"registrarsafe.com" is back up. It is, indeed, Facebook's very own registrar for Facebook's own domains. "RegistrarSEC, LLC and RegistrarSafe, LLC are ICANN-accredited registrars formed in Delaware and are wholly-owned subsidiaries of Facebook, Inc. We are not accepting retail domain name registrations." Their address is Facebook HQ in Menlo Park.
That's what you have to do to really own a domain.
Out of curiosity, I looked up how much it costs to become an registrar. Based on the ICANN site, it is $4,000 USD per yr, plus variable fees and transactions fees ($0.18/yr). Does anyone have experience or insight into running a domain registrar? Curious what it would entail (aside from typical SRE type stuff).
Wow, I had no idea it was so cheap[1] once you're a registrar. The implication is that anyone who wants to be a domain squatting tycoon should become a registrar. For an annual cost of a few thousand dollars plus $0.18 per domain name registered, you can sit on top of hundreds of thousands of domain names. Locking up one million domain names would cost you only $180,000 a year. Anytime someone searched for an unregistered domain name on your site, you could immediately register it to yourself for $0.18, take it off the market, and offer to sell it to the buyer at a much inflated price. Does ICANN have rules against this? Surely this is being done?
[1] "Transaction-based fees - these fees are assessed on each annual increment of an add, renew or a transfer transaction that has survived a related add or auto-renew grace period. This fee will be billed at USD 0.18 per transaction." as quoted from https://www.icann.org/en/system/files/files/registrar-billin...
Personally saw this kind of thing as early as 2001.
Never search for free domains on the registar site unless you are going to register it immediately. Even whois queries can trigger this kind of thing, although that mostly happens on obscure gtld/cctld registries which have a single registrar for the whole tld.
I can sadly attest to this behavior as recently as a couple years ago :(
I searched for a domain that I couldn't immediately grab (one of more expensive kind) using a random free whois site... and when I revisited the domain several weeks later it was gone :'(
Emailed the site's new owner D: but fairly predictably got no reply.
Lesson learned, and thankfully on a domain that wasn't the absolute end of the world.
I now exclusively do all my queries via the WHOIS protocol directly. Welp.
Probably every major retail registrar was rumored to do this at some point. Add to your calculation that even some heavyweights like GoDaddy (IIRC) tend to run ads on domains that don't have IPs specified.
Network Solutions definitely did it. I searched for a few domains along the lines of "network-solutions-is-a-scam.com", and watched them come up in WHOIS and DNS.
This is not completely accurate. The whole reason a registrar with domain abc.com can use ns1.abc.com is because glue records are established at the registry, this allows a bootstrap that keeps you in from a circular dependency. All that said it’s usually a bad idea, for someone as large as Facebook they should have nameservers across zones ie a.ns.fb.com
b.ns.fb.org
c.ns.fb.co
Etc…
There is always a step which involve to email the domain when a domain update its information with the registrar. In this case, facebook.com and registrarsafe.com are managed by the same NS. You need these NS to query the MX to send that update approval by email and unblock the registrar update. Glue records are more for performance than to make that loop. I'm maybe missing something but, hopefully they won't need to send an email to fix this issue.
I have literally never once received an email to confirm a domain change. Perhaps the only exception is on a transfer to another registrar (though I can't recall that occurring, either).
To be fair, we did have to get an email from eurid recently for a transfer auth code, but that was only because our registrar was not willing to provide.
In any case, no, they will not need to send an email to fix this issue.
I just changed the email address on all my domains. My inbox got flooded with emails across three different domain vendors. If they didn't do it before, they sure are doing it now.
This is not true when your the registrar (as in this case) in fact your entire system could be down and you’d still have access to the registries system to do this update
Facebook does operate their own private Registrar, since they operate tens of thousands of domains. Most of these are misspellings and domains from other countries and so forth.
So yes, the registrar that is to blame is themselves.
Source: I know someone within the company that works in this capacity.
> That's also probably why we're seeing the domain "facebook.com" for sale on domain sites. The registrar that would normally provide the ownership info is down.
That’s not how it works. The info of whether a domain name is available is provided by the registry, not by the registrars. It’s usually done via a domain:check EPP command or via a DAS system. It’s very rare for registrar to registrar technical communication to occur.
Although the above is the clean way to do it, it’s common for registrars to just perform a dig on a domain name to check if it’s available because it’s faster and usually correct. In this case, it wasn’t.
When the NS hostname is dependent on the domain it serves, "glue records" cover the resolution to the NS IP addresses. So there's no circular dependency type issue
Its partially there. C and D are still not in the global tables according to routeviews ie. 185.89.219.12 is still not being advertised to anyone. My peers to them in Toronto have routes from them, but not sure how far they are supposed to go inside their network. (past hop 2 is them)
% traceroute -q1 -I a.ns.facebook.com
traceroute to a.ns.facebook.com (129.134.30.12), 64 hops
max, 48 byte packets
1 torix-core1-10G (67.43.129.248) 0.133 ms
2 facebook-a.ip4.torontointernetxchange.net (206.108.35.2) 1.317 ms
3 157.240.43.214 (157.240.43.214) 1.209 ms
4 129.134.50.206 (129.134.50.206) 15.604 ms
5 129.134.98.134 (129.134.98.134) 21.716 ms
6 *
7 *
% traceroute6 -q1 -I a.ns.facebook.com
traceroute6 to a.ns.facebook.com (2a03:2880:f0fc:c:face:b00c:0:35) from 2607:f3e0:0:80::290, 64 hops max, 20 byte packets
1 toronto-torix-6 0.146 ms
2 facebook-a.ip6.torontointernetxchange.net 17.860 ms
»The Facebook outage has another major impact: lots of mobile apps constantly poll Facebook in the background = everybody is being slammed who runs large scale DNS, so knock on impacts elsewhere the long this goes on.«
You just need to get a large enough block so that you can throw most of it away by adding your own vanity part to the prefix you are given. IPv6 really isn't scarce so you can actually do that.
My suspicion is that since a lot of internal comms runs through the FB domain and since everyone is still WFH, then its probably a massive issue just to get people talking to each other to solve the problem.
I remember my first time having a meeting at Facebook and observing none of the doors had keyholes and thinking "hope their badge system never goes down"
> I remember my first time having a meeting at Facebook and observing none of the doors had keyholes and thinking "hope their badge system never goes down"
Every internet-connected physical system needs to have a sensible offline fallback mode. They should have had physical keys, or at least some kind of offline RFID validation (e.g. continue to validate the last N badges that had previously successfully validated).
I have no doubt that the publicly published post-mortem report (if there even is one) will be heavily redacted in comparison to the internal-only version. But I very much want to see said hypothetical report anyway. This kind of infrastructural stuff fascinates me. And I would hope there would be some lessons in said report that even small time operators such as myself would do well to heed.
Around here we use Slack for primary communications, Google Hangouts (or Chat or whatever they call it now) as secondary, and we keep an on-call list with phone numbers in our main Git repo, so everyone has it checked out on their laptop, so if the SHTF, we can resort to voice and/or SMS.
I remembered to publish my cell phone's real number on the on-call list rather than just my Google Voice number since if Hangouts is down, Google Voice might be too.
We don't use tapes, everything we have is in the cloud, at a minimum everything is spread over multiple datacenters (AZ's in AWS parlance), important stuff is spread over multiple regions, or depending on the data, multiple cloud providers.
Last time I used tape, we used Ironmountain to haul the tapes 60 miles away which was determined to be far enough for seismic safety, but that was over a decade ago.
One of my employers once forced all the staff to use an internally-developed messenger (for sake of security, but some politics was involved as well), but made an exception for the devops team who used Telegram.
Why? Even if it's not DNS reliance, if they self-hosted the server (very likely) then it'll be just as unreachable as everything else within their network at the moment.
I don't think it's cocky or 20/20 hindsight. Companies I've worked for specifically set up IRC in part because "our entire network is down, worldwide" can happen and you need a way to communicate.
My small org, maybe 50 ips/hosts we care about, maintain a hosts file stills, for those nodes public and internal names. It's in Git, spread around and we also have our fingers crossed.
If only IRC would have been built with multi-server setups in mind, that forward messages between servers, and continues to work if a single - or even a set - of servers would go down, just resulting in a netsplit...Oh wait, it was!
My bet is, FB will reach out to others in FAMANG, and an interest group will form maintaining such an emergency infrastructure comm network. Basically a network for network engineers. Because media (and shareholders) will soon ask Microsoft and Google what their plans for such situations are. I'm very glad FB is not in the cloud business...
> If only IRC would have been built with multi-server setups in mind, that forward messages between servers, and continues to work if a single - or even a set - of servers would go down, just resulting in a netsplit...Oh wait, it was!
yeah if only Facebook's production engineering team had hired a team of full time IRCops for their emergency fallback network...
Considering how much IRCops were paid back in the day (mostly zero as they were volunteers) and what a single senior engineer at FB makes, I'm sure you will find 3-4 people spread amongst the world willing to share this 250k+ salary amongst them.
I worked on the identity system that chat (whatever the current name is) and gmail depend on and we used IRC since if we relied on the system we support we wouldn’t be able to fix it.
Word is that the last time Google had a failure involving a cyclical dependency they had to rip open a safe. It contained the backup password to the system that stored the safe combination.
The safe in question contained a smartcard required to boot an HSM. The safe combination was stored in a secret manager that depended on that HSM.
The engineer attempted to restart the service, but did not know that a restart required a hardware security module (HSM) smart card. These smart cards were stored in multiple safes in different Google offices across the globe, but not in New York City, where the on-call engineer was located. When the service failed to restart, the engineer contacted a colleague in Australia to retrieve a smart card. To their great dismay, the engineer in Australia could not open the safe because the combination was stored in the now-offline password manager.
Safes typically have the instructions on how to change the combination glued to the inside of the door, and ending with something like "store the combination securely. Not inside the safe!"
But as they say: make something foolproof and nature will create a better fool.
Anyone remember the 90s? There was this thing called the Information Superhighway, a kind of decentralised network of networks that was designed to allow robust communications without a single point of failure. I wonder what happened to that...?
We are a dying breed... A few days ago my daughter asked me "will you send me the file on Whatsapp or Discord?". I replied I will send an email. She went "oh, you mean on Gmail?" :-D
I unfortunately cannot edit the parent comment anymore but several people pointed out that I didn't back up my claim or provided any credentials so here they are:
Google has multiple independent procedures for coordination during disasters. A global DNS outage (mentioned in https://news.ycombinator.com/item?id=28751140) was considered and has been taken into account.
I do not attempt to hide my identity here, quite the opposite: my HN profile contains my real name. Until recently a part of my job was to ensure that Google is prepared for various disasterous scenarios and that Googlers can coordinate the response independently from Google's infrastructure. I authored one of the fallback communication procedures that would likely be exercised today if Google's network experienced a global outage. Of course Google has a whole team of fantastic human beings who are deeply involved in disaster preparedness (miss you!). I am pretty sure they are going to analyze what happened to Facebook today in light of Google's emergency plans.
While this topic is really fascinating, I am unfortunately not at liberty to disclose the details as they belong to my previous employer. But when I stumble upon factually incorrect comments on HN that I am in a position to correct, why not do that?
Interesting that you are asking for the dirt given that DiRT stands for Disaster and Recovery Testing, at least at Google.
Every year there is a DiRT week where hundreds of tests are run. That obviously requires a ton of planning that starts well in advance. The objective is, of course, that despite all the testing nobody outside Google notices anything special. Given the volume and intrusiveness of these tests, the DiRT team is doing quite an impressive job.
While the DiRT week is the most intense testing period, disaster preparedness is not limited to just one event per year. There are also plenty tests conducted througout the year, some planned centrally, some done by individual teams. That's in addition to the regular training and exercises that SRE teams are doing periodically.
Because, it may shock you to know, but sometimes people just go on the Internet and tell lies.
No shit Google has plans in place for outages.
But what are these plans, are they any good... a respected industry figure who's CV includes being at Google for 10 years doesn't need to go into detail describing the IRC fallback to be believed and trusted that there is such a thing.
I found a comment that was factually incorrect and I felt competent to comment on that. Regrettably, I wrote just one sentence and clicked reply without providing any credentials to back up my claim. Not that I try to hide my identity, as danhak pointed out in https://news.ycombinator.com/item?id=28751644, my full name and URL of my personal website are only a click away.
I've read here on HN that exactly this was the issue as they had one of the bigger outages (I think it was due to some auth service failure) and GMail didn't accept incoming mail.
I think the issue there is that in exchange for solving the "one fat finger = outage" problem, you lose the ability to update the server fleet quickly or consistently.
The rate at which some amazon services lately go done because other AWS services went down proves that this is an unsustainable house of cards anyways.
Sheera Frenkel
@sheeraf
Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors.
From the Tweet, "Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors."
> Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors.
Disclose.tv
@disclosetv
JUST IN - Facebook employees reportedly can't enter buildings to evaluate the Internet outage because their door access badges weren’t working (NYT)
Oh I'm sure everyone knows whats wrong, but how am I supposed to send an email, find a coworkers phone number, get the crisis team on video chat etc etc if all of those connections rely on the facebook domain existing?
Hence the suggestion for PagerDuty. It handles all this, because responders set their notification methods (phone, SMS, e-mail, and app) in their profiles, so that when in trouble nobody has to ask those questions and just add a person as a responder to the incident.
Yes, but Facebook is not a small company. Could PagerDuty realistically handle the scale of notifications that would be required for Facebook's operations?
PagerDuty does not solve some of the problems you would have at FB's scale, like how do you even know who to contact ? And how do they login once they know there is a problem ?
The place where I worked had failure trees for every critical app and service. The goal for incident management was to triage and have an initial escalation for the right group within 15 minutes. When I left they were like 96% on target overall and 100% for infrastructure.
Even if it can’t, it’s trivial to use it for an important subset, ie is Facebook.com down, is the ns stuff down etc. So there is an argument to be made for still using an outside service as a fallback
- not arrogant
- or complacent
- haven't inadvertently acquired the company
- know your tech peers well enough to have confidence in their identity during an emergency
- do regular drills to simulate everything going wrong at once
Lots of us know what should be happening right now, but think back to the many situations we've all experienced where fallback systems turned into a nightmarish war story, then scale it up by 1000. This is a historic day, I think it's quite likely that the scale of the outage will lead to the breakup of the company because it's the Big One that people have been warning about for years.
I guarantee you that every single person at Facebook who can do anything at all about this, already knows there's an issue. What would them receiving an extra notification help with?
We kind of got off topic, I was arguing that if you were concerned about internal systems being down (including your monitoring/alerting) something like pager duty would be fine as a backup. Even at huge scale that backup doesn’t need to watch everything.
I don’t think it’s particularly relevant to this issue with fb. I suspect they didn’t need a monitoring system to know things were going badly.
I can imagine this affects many other sites that use FB for authentication and tracking.
If people pay proper attention to it, this is not just an average run of the mill "site outage", and instead of checking on or worrying about backups of my FB data (Thank goodness I can afford to lose it all), I'm making popcorn...
Hopefully law makers all study up and pay close attention.
What transpires next may prove to be very interesting.
NYT tech reporter Sheera Frenkel gives us this update:
>Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors.
I just got off a short pre-interview conversation with a manager at Instagram and he had to dial in with POTS. I got the impression that things are very broken internally.
How much of modern POTS is reliant on VOIP? In Australia at least, POTS has been decommissioned entirely, but even where it's still running, I'm wondering where IP takes over?
This person has a POTS line in their current location, and a modem, and the software stack to use it, and Instagram has POTS lines and modems and software that connect to their networks? Wow. How well do Instagram and their internal applications work over 56K?
The voices, stories, announcements, photos, hopes and sorrows of millions, no, literally billions of people, and the promise that they may one day be seen and heard again now rests in the hands of Dave, the one guy who is closest to a Microcenter, owns his own car and knows how to beat the rush hour traffic and has the good sense to not forget to also buy an RS-232 cable, since those things tend to get finicky.
Yeah the patch to fix BGP to reach the DNS is sent by email to @facebook.com. Ooops no DNS to resolve the MX records to send the patch to fix the BGP routers.
No. A network like Facebook's is vast and complicated and managed by higher-level configuration systems, not people emailing patches around.
If this issue is even to do with BGP it's much more likely the root of the problem is somewhere in this configuration system and that fixing it is compounded by some other issues that nobody foresaw. Huge events like this are always a perfect storm of several factors, any one or two of which would be a total noop alone.
On the other hand, I and my office mate at the time negotiated the setup of a ridiculous number of BGP sessions over email, including sending configs. That was 20 years ago.
I don't know. I doubt. It's just funny to think that you need email to fix BGP, but DNS is down because of BGP. You need DNS to send email which needs BGP. It's a kind of chicken and egg problem but at a massive scale this time.
Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors.
You'd think they'd have worked that into their DR plans for a complete P1 outage of the domain/DNS, but perhaps not, or at least they didn't add removal of BGP announcements to the mix.
I would have expected a DNS issue to not affect either of these.
I can understand the onionsite being down if facebook implemented it the way a thirdparty would (a proxy server accessing facebook.com) instead of actually having it integrated into its infrastructure as a first class citizen.
You can get through to a web server, but that web server uses DNS records or those routes to hit other services necessary to render the page. So the server you hit will also time out eventually and return a 500
The issue here is that this outage was a result of all the routes into their data centers being cut off (seemingly from the inside). So knowing that one of the servers in there is at IP address "1.2.3.4" doesn't help, because no-one on the outside even knows how to send a packet to that server anymore.
reply